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A perceptron that learns the opposite of its own output is used to generate a time series. We 
analyse properties of the weight vector and the generated sequence, like the cycle length and the 
probability distribution of generated sequences. A remarkable suppression of the autocorrelation 
function is explained, and connections to the Bernasconi model are discussed. If a continuous 
transfer function is used, the system displays chaotic and intermittent behaviour, with the product 
of the learning rate and amplification as a control parameter. 



For every prediction algorithm that maps a binary time 
series onto a binary output there is a sequence for which 
it gives 100% wrong predictions This sequence can 
be constructed easily by having the algorithm predict 
the next bit in the time series and continuing the se- 
ries with the opposite of the prediction. Of course, this 
sequence will only make one given algorithm with one 
given set of initial parameters fail completely. However, 
it is still interesting to compare the properties of such 
an antipredictable sequence with one that can be pre- 
dicted with good success by the same algorithm. We will 
study the statistical properties of the time series gener- 
ated by one particular prediction machine, namely the 
perceptron using the Hebb learning rule. 

The perceptron is the simplest type of feed-forward 
neural networks 0. It consists of N input units that 
are connected to one output unit by N synaptic weights 
Wi, i = 1,..,N. An input vector x = (xi, .., Xn) is 
mapped onto an output a by a sigmoidal function of 
the scalar product of x and w: a = / XiWi), where 
f(x) = sign(a;) is used for the so-called simple percep- 
tron, and the error functon f(x) — erf (f3x) or the byber- 
bolic tangent tanh(/3:r) with an adjustable amplification 
(3 are common choices for the continuous perceptron. In 
Section | the sequence generated by a simple perceptron 
that learns the opposite of its own output is examined, 
whereas in Section |H| a continuous perceptron is used, 
and the differences between the two cases are highlighted. 



I. THE CONFUSED BIT GENERATOR 

Perceptrons have been used for generating binary time 
series in a simple iteration that was named Bit Gener- 



ator (BG) 



the pattern x* at time t 



window of a binary time series S, x* = (5*, . 



is an A^-bit 

S f G { — 1, 1}. The series is generated by the output of 
the perceptron: S t+1 = sign(x*-w). For fixed w, the 
sequence relaxes into a limit cycle whose average length 
increases more slowly than exponentially with N. Short 
cycles with a length I < 2N are more likely than longer 
ones, and the Fourier spectrum of the sequence is dom- 
inated by one frequency which is also prominent in the 



weights H|j . The cycles can be calculated analytically if 
the weights have only one Fourier component pj . 

In Q, a variation of the BG was introduced in which 
the next bit of the sequence is the opposite of the per- 
ceptron's output, and the network learns the sequence 
according to the Hebb rule with a learning rate r\: 



-sign(x*-w t ); 
w* + (r]/N)S t 



(1) 
(2) 



We call this system Confused Bit Generator (CBG), be- 
cause the perceptron is told that its output was wrong 
no matter what it predicted. 



A. Dynamics of the weights 

Geometrically, w does a directed random walk on 
an A^-dimensional cubic lattice: each component of the 
learning step is ±rj/N. Thus, while the values of the 
weight components io, are real numbers, they can only 
take discrete values w s ° ± nr]/N with n = 0, 1, 2, . . . once 
the initial values w® are chosen. 

Furthermore, each learning step has a negative overlap 
with the current w, which prevents a boundless growth 
of the vector. The norm of the weight vector fluctuates 
around an equilibrium value that can be estimated by re- 
placing x with a random vector whose components have 
a variance of 1, taking the square of Eq. (0) and applying 
the usual formalism for online learning pj: 



(w* +1 -w* +1 - w*V ) = -^<x t -w t Bign(x*-w t )) 



N 



(3) 



Introducing a time scale a with da — 1 /N and averaging 
over x, this becomes a deterministic differential equation 
for the norm w of w in the thermodynamic limit N — > oo: 



dw 
da 



2w' 



(4) 
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The attractive fixed point of this equation is w = 
^/ 7r/8?7 = 0.6267ry. However, using the time series gen- 
erated by the perceptron as patterns, simulations give 
a slightly different value of w w 0. 56677, independent of 
N (this was already observed in Two possible vio- 
lations of the assumptions for which the analysis in |t]] 
guarantees agreement with analytical predictions must 
be considered: first, the time series patterns generated by 
the CBG do now follow a uniform distribution (see Sec. 
IE). Second, they are not drawn independently from 
the weight vector and previous patterns. Simulations in 
which a perceptron was given patterns drawn randomly 
from a distribution as described in Sec. IE yield a norm 



w that is compatible with the analytical value of 0.6267r7. 
This indicates that temporal correlations are responsible 
for the deviations. 

The learning rate rj only sets a length scale, but does 
not influence the long-term behaviour of the system. 

In a similar fashion, the autocorrelation of the weight 
vector can be calculated using the assumption of random 
patterns: 



w 2 exp 



4 T 
'ttN 



(5) 



In some cases (see Section EC) it is useful to assign an 
individual learning rate r\i to each weight component Wi . 
A short calculation shows that the mean square norm 
of each weight component is proportional to its learning 
rate: 



(w. 




(6) 



A component with a higher learning rate thus has a 
stronger influence on the output. This also leads to a 
faster decay of the autocorrelation: 



(^H i+T > 



— - exp 

m 4 



The dynamics of the weights can be linked to the the 
autocorrelation function C\ of the sequence, defined by 



C 



(8) 



i=i 



where t is the number of patterns summed over. Simply 
add t update steps according to Eq. (0) : 



(r,/N)S i S i -'=w j + (T,/N)C t j . (9) 



Each value C* for 1 < j < N corresponds to the distance 
of the weight vector from its starting point along one axis 
in the iV-dimensional weight space, measured in units of 
rj/N. This point is important and will be exploited in 
the following paragraphs. 



B. Cycles and transients 

The CBG is a deterministic map with a discrete, finite 
state space. This means that it falls into a cycle of some 
length I eventually: both sequence and weights repeat 
after I steps, i.e. w* = w t+/ or alternatively Cj = for 
1 < j < N after I steps. This means that I must be 
divisible by 4, since only sequences with I mod 4 = can 
have an autocorrrelation of 0. Also, a lower bound for I 
can be given: for the Zth autocorrelation value one gets 
C\ = I 7^ 0, therefore I > N. By renaming indices, one 
finds for a periodic sequence Cj = C\_y If I < 2N, one 
thus gets Cj = for all j < I. In Ref. ||, it is conjec- 
tured that such a sequence does not exist except for any I 
except 1 = 4. If this is true, I > 2N must hold for N > 3. 

An upper bound on the cycle length can be found by 
estimating how many states in weight space the weight 
vector can take. Assuming that it stays inside an TV- 
dimensional hypersphere of radius Wf = 0.566?7 and vol- 
ume V = wf n N / 2 /T(N/2 + 1) , we can divide that vol- 
ume by the volume of a unit cell, (rj/N) N , and expand 
using Stirling's equation. We find that the number of 
possible states in weight space scales approximately like 
5A5 N /y/~N. Combining this with 2 N possible sequences 
gives 10.9 N /\/N possible states of the system. 

Simulations show that not all of these states are part 
of a cycle: starting from random initial conditions, there 
is a transient whose median length scales approximately 
like 2.04^. The transient distribution (Fig. [I]) shows 
that not all states have the same probability of being 
part of a cycle: the probability for a very short tran- 
sient is smaller than that for a longer one, which implies 
that some sort of annealing occurs during the first steps. 
Simulations were done with random initial sequences and 
random initial vectors normalized to w = 0. 56677. 
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FIG. 1. Distribution of transient lengths t of a CBG with 
N = 8. The small probability of short t indicates that only a 
small fraction of state space is part of a cycle. 



2 



The distribution of cycle lengths I found in simulations 
shows the expected features (see Fig. |^): a minimum cy- 
cle length I > 2N and no cycle lengths I that are not 
divisible by 4. There is a distinct maximum near the 
minimum cycle length and a broad distribution that falls 



off slightly faster than exponentially for large I. The av- 
erage of I scales approximately like 2.2 N , as seen in Fig. 
||. The fact that the largest I that is found scales expo- 
nentially with N suggests that there is an exponential 
number of different cycles. 
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FIG. 2. Distribution of period lengths of a CBG with N — 10 on logarithmic and linear scales. The initial weights and 
sequence were random. All I are divisible by 4, and I > 2N. Error bars denote the standard error. 
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FIG. 3. Average, smallest and largest cycle I found in sim- 
ulations with 1000 random initial conditions for each value of 
N. The full line is an exponential fit to the data for N > 9, 
the dotted line denotes the theoretical lower bound of I > 2N 
for N > 3. 



C. Autocorrelation function and the Bernasconi 
model 

The autocorrelation of the sequence shows some pe- 
culiarities, as seen in Fig. ||: As explained, the first j 
values correspond to components of w. Since w is finite, 
C\ is bounded for 1 < j < N, i.e. it does not grow like 



y/i as it would for a random sequence. The values for 
N < j < 2N show negative correlations that grow lin- 
early with t for even j, whereas they are compatible with 
a random sequence for odd j. Between 2N and 37V, cor- 
relations are positive for even j and for odd j. These 
effects appear for all N in both the transient and the 
cycle as long as the cycle length is much larger than N. 
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FIG. 4. Autocorrelation function Cj/t of a CBG with 
N = 50, averaged over t = 2x 10 6 patterns. 

Bit series with low autocorrelations are of interest in 
mathematics and have applications in signal processing 
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||. It is therefore interesting whether the CBG gener- 
ates sequences with autocorrelations significantly lower 
than for random series. Two measures are commonly 
used in the literature: for periodic sequences of length 
I, an energy function (which is studied in the so-called 
Bernasconi model for periodic boundary conditions 10|) 
can be defined by 



l-i 



l-i i 



(10) 



j=l i=l 



Results on the ground states of this Hamiltonian can be 
found in J8j. By trial and error, initial conditions for the 
CBG can be found which yield cycles slightly larger than 
2N, for which all value of Cj except one are 0. However, 
even for the best sequences we found, H p was larger than 
the known ground state energies by at least a factor of 2. 

The original model does not use periodic boundary 
conditions: in a sequence of length p, only the sum over 
p — j different terms with a lag of j can be calculated. 
The energy is therefore given by 



H, 



ap 



= £«T 



j\2 



(11) 



(note the summation limits). The so-called merit factor 
F introduced by Golay 111) is defined by 



F = 



V 



2H, 



(12) 



ap 



A merit factor of 1 is expected for a random sequence; 
lower autocorrelations yield higher F. The theoretical 
limit for large p is conjectured to be about F — 12 [(To) , 
whereas optimization routines typically find sequences 
with 5 < F < 9 (see Q and references therein) and 
exact enumeration for small p suggests linip-joo F = 9.3 
for the optimal sequence ]T^ |. 

To estimate the merit factor of sequences generated by 
the CBG analytically, we solve Eq. (||) for Cj 2 and use 
the autocorrelation of the weights given by Eq. (||) : 



<(cr') 2 > 



2w t „w°) 



N 2 2 

^vfl-exp (_i£zi 
4 V V tt N 



(13) 



The energy can be expressed as a sum or approxated 
by an integral in continuous variables a = p/N and 
13 = j/N. Since Eq. only holds for 1 < j < N, 

_ 2 — 

C- J = p — j must used for j > N. We get the expres- 
sion 



3=1 V V 

J 7V 2 J(l-exp(-(4/7r)( a -/5)))d/3 



N 2 ^-(a - ^(1 - exp(--a))) for j < N and (14) 

4 4 7T 



Hap = N 



TT ( TT 4 4 

T 1 - 7( ex P(-(l ~ «)) - ex P( ")) 

4 \ 4 TT TT 



+ -(a-lY) iorj>N. 



(15) 



The corresponding merit factor is compared to simula- 
tions in Fig. ||: Eqs. ( |l4| ) and ( p^| ) give qualitatively 
correct results, but differ from the observed values by 
roughly 10%. The feedback mechanisms of the CBG 
cause a faster decay of Cj than predicted for random 
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FIG. 5. Merit factor F as a function of scaled sequence 
length p/N, compared to Eq. ([[5|). Error bars denote the 
standard deviation of F for iV = 100, not the standard error. 



We observed in Section I A that individual learning 



rates can make Cj decay faster for some j and slower for 
others. The search for a minimal H ap can be written as 
an optimization problem in the continuous function 77 
where r/j = rj{j/N). Solving this problem with a varia- 
tional approach, one finds that it is sensible to give the 
last 41% of the weights a learning rate and norm of zero, 
and increase the learning rate continuously towards com- 
ponents with smaller indices. Unfortunately, even this 
optimization does not improve the merit factor beyond 
F = 1.74 in theory and (F) — 1.86 in simulations. This 
is still a lot worse than the results of other optimization 
methods [|o|,^2), so the CBG is not a competitive gen- 
erator of low-autocorrelation sequences. Nevertheless, it 
has some interesting possibilities: 



D. Shaping the autocorrelation function 

Being able to suppress autocorrelations, the CBG is 
also capable of controlling the shape of the autocorre- 
lation function in the long-time limit. Using Eqs. (1q) 



4 



and (||) in the limit where w° — > and for non- negative 15 ~ 

learning rates, one can obtain the inverse relation be- 
tween the square of the autocorrelation function (Cj) 2 
and the corresponding learning rate r\j 

10 - 

|C(K)| 

« = (16) 
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Thus, any desired shape of the autocorrelation function 
is achievable by using the appropriate profile for rjj which 
can be extracted from Eq.(|lq). High performance of the 
CBG as a producer of sequences with specific desired 
shapes of the autocorrelation function is observed in sim- 
ulations. This feature of the CBG is demonstrated in 
Figs. H and ^ where both an exponential and a polyno- 
mial profile of the autocorrelations are successfully gen- 
erated. The slight deviations from the target profile are 
probably due to a violation of the assumption of random 
patterns. Simulations are done for a CBG with 30 in- 
put units. The autocorrelations are calculated for time 
windows of 100000 bits and are averaged over 1000 such 
successive windows. Checking a wide variety of shapes, 
the CBG exhibits a decent capability of achieving the ex- 
pected profiles. It could be used as an alternative mech- 
anism for generating colored binary sequences using lo- 
cal rules instead of nonlocal mechanisms auch as Fourier 
transforms. 
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FIG. 7. Profile of the average absolute value of the auto- 
correlation function \Cj\ which was achieved by a CBG with 
N=30 input units using r\j = j/N (dashed curve). The solid 
curve stands for the desired profile, \Cj\ = 2.427-v/ j/N. 

The limited use of the CBG in generating sequences 
with a high merit factor may be related to phase space 



arguments: as seen in Sec. [B, the CBG can still gener- 



ate exponentially many different time series depending on 
initial conditions, whereas there are very few sequences 
with the highest achievable high merit factors (see 111] 
for the density of states with cyclic boundary conditions). 
The mechanism of the CBG allows for manipulation of 
the autocorrelation function only if the constraints on the 
desired sequence are not too strong, such as suppressing 
all of the elements of Cj on a short time scale. On the 
other hand, choosing a given shape for long-time averages 
of Cj still allows for many realizations of the sequence. 
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FIG. 6. Profile of the average absolute value of the 
autocorrelation function \Cj\ which was achieved by a 
CBG with N=30 input units using r]j = exp(—2j/N) 
(dashed curve). The solid curve stands for the de- 
sired profile, |C| = Aexp(j/N), where A — \fit/?> 

V(e- 2 - - 1)- 



E. Distribution of generated sequences 



The structure in Cj shows that the CBG does not gen- 
erate a random sequence. This becomes more obvious 
in a histogram of subsequences generated by the system. 
Fig. U shows the probability distribution of 8-bit sub- 
strings from a run of a CBG with TV = 50, encoded as 
decimal integers. Some strings are strongly suppressed, 
most notably (binary 00000000), 85 (01010101), 170 
(10101010) and 255 (11111111). Other sequences with 
below-average likelihood also correspond to "simple" se- 
quences, like 15 (00001111) and 51 (00110011). Contin- 
ued simple sequences give high values of some compo- 
nents of the autocorrelation function, which is unlikely 
as explained above. 

The shape of the histogram is the same for all N in 
both the transient and the cycle. However, I must be 
much larger than the number of bins in the histogram. 
The amplitude of the deviations from uniform distribu- 
tion again goes like 1/N. 
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FIG. 8. Histogram of 8-bit subsequences generated by a 
CBG with N = 50, averaged over 5 x 10 6 steps. The y axis 
gives the probability of a subsequence normalized to 1. 



Ordering histograms in descending rank order often 
reveals insights into the underlying processes and phase 
space structure (see e.g. p5| , ^6[ ). In our case, the rank 
ordered histogram does not show a power law or other 
universal behaviour, as seen in Fig. |^. 

One way to explain this histogram is by finding the 
stationary distribution for a biased random walk on a 
DeBruijn graph, as was done in Jl6| , p7[ : a subsequence S 
is followed by 1 with probability ps and by with proba- 
bility 1 — ps- It is possible to reproduce the histogram of 
the CBG accurately this way; however, one has to take 
the transition probabilities ps for each subsequence from 
simulations of the CBG. There is no obvious way of cal- 
culating them analytically, and taking random transition 
probabilities does not reproduce the shape of Fig. |[ 




FIG. 9. Frequencies of 12-bit subsequences generated by a 
CBG with N — 50, ordered by rank k and normalized to an 
average of 1 (solid line). The frequencies reproduced by a 
Markov process are also displayed (dotted line), but indistin- 
guishable from the first curve. 



The CBG may be considered the simplest case of 
a sequence-generating perceptron that deterministically 
changes its direction. The sequence generated by it, while 
complex, has many properties that can be understood at 
least qualitatively, especially those that can be linked to 
the autocorrelation function. It is not by any standard a 
satisfying random bit sequence, and while results derived 
from the assumption of random patterns are usually qual- 
itatively correct, the exact values have to be modified. 



II. THE CONFUSED SEQUENCE GENERATOR 

The simplest generalization of the CBG to a continu- 
ous perceptron replaces the sign function in Eq. (0) by a 
continuous sigmoidal function: 



N 



S t+1 = -eritf^ViSt-j+i) 



= -crf(/3w*-x*) 



r t+i 



N 



x , 



(17) 
(18) 



where the only new quantity is the amplification /?, and 
h* is an abbreviation of w* -x . The generation of a time 
series by a continuous perceptron with fixed weights was 
studied in a number of publications |l8|-p0| , in which the 
system was named Sequence Generator (SGen). We will 
call the mapping denned by @ and © Confused Se- 
quence Generator (CSC). 

The SGen with fixed weights has a critical amplifica- 
tion (3 C that depends on w, below which S = is the 
attractive fixed point. Above (3 C , the zero solution be- 
comes repulsive, and the SGen generates a periodic or 
quasiperiodic time series with an attractor dimension of 
or 1 for most choices of w and {3 |l8|. This attractor 
is robust to noise |lj|. The time series displays chaotic 
behaviour only for very special choices of w and ( "frag- 
ile chaos") if the transfer function is monotonic, and for 
generic initial conditions ("robust chaos") only if it is 
non- monotonic |2Cf | . We will compare these properties to 
those of the CSC. 



A. Mean-field solution for w 



Similar to the CBG, the weight vector of the CSC does 
a directed random walk near the surface of a hypersphcre 
of radius w. Unlike the CBG, the length of the learning 
steps depends on the magnitude of the output, which in 
turn depends on w and the outputs in previous timesteps. 
To find an approximate solution to this self-consistency 
problem, we will first ignore correlations between pat- 
terns and weights and treat the patterns as random and 
independent. 
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In this approach, the inner field ft, is a Gaussian random 
variable of mean and variance w 2 S 2 , where S 2 = (5 l * 2 )t 
is the mean square output of the system. 

The norm w is found by taking the square of jl8|): 



„t+i' 



^w'-x'erfOSx'-w') + -j^V-j 



(19) 



and averaging over the input patterns. The self-overlap 
x-x is on the average iViS* 2 , so the fixed point of w is given 
by 



2(h erf(/3/i)) = tjS 4 



(20) 



The average on the left hand side can be evaluated and 
leads to 



(3w z S 



2 c2 



y/l + 2f3 2 W 2 S 2 



rjS 4 , or 



irfi^S 6 + 77S 2 0ryi6 + n/3 2 7 1 2 S 8 
16/3 



(21) 
(22) 



Let us now turn to S 2 . The probability distribution of 
S itself is rather awkward, since it involves inverse error 
functions, and its slope diverges at S = ±1. However, S 2 
can be easily calculated by using the distribution of h: 



for random patterns. This causes a factor of roughly 0.82 
between the theoretical and observed value of w 2 seen in 
Fig. |l0|). The same factor is found in the CBG. 

For large 7, S 2 goes to 1 (as it should, since the system 
is identical to the CBG if 7 = 00), and the theoretical 
prediction for w goes to ■J'tt/St], just like in the CBG. 
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FIG. 10. Mean-field solution of the CBG, compared to sim- 
ulations with N = 400. 



(S 2 )= erf 2 (/3/i)(2™ 2 ^ 2 )- 1 / 2 exp(- 



2(5 2 w 2 S 2 



h 2 



2w 2 S 2 



— arcsm 

7T 



1 + 2(3 2 w 2 S 2 



dh 



(23) 



Plugging w 2 (rj, /?, S 2 ) from Eqs. (£2|) into ( |23| ) and solv- 
ing numerically, one obtains a self-consistent solution for 
S 2 . A closer look at the equations reveals that if a new 
quantity 7 = rjf3 is introduced, only 7 enters into the 
equation for S 2 , and w 2 is of the form w 2 — Tj w (7)1 
so only one curve must be considered. This is intuitive, 
since a higher r\ eventually leads to a higher w, which 
has the same effect on S 2 as having a smaller w, but 
multipying wx with a higher factor (3. 

The map defined by ([l?]) and ( |T8| ) always has the triv- 
ial solution 5 = 0. Only for a sufficiently high 7 > "f c are 
the outputs high enough to sustain a nonvanishing solu- 
tion. Note that S = is always an attractive solution for 
all 7 < 00, but its basin of attraction becomes smaller 
for larger 7. 

The numerical solution of Eqs. (|2lf) and (^3|) shows 
that the system undergoes a saddle-node bifurcation at 
7 C = 5.785, which is in good agreement with simulations. 
Above 7c, two new fixed points exist, only one of which is 
stable. While for S* 2 ^) excellent agreement is found be- 
tween theory and simulation (see Fig. [To]), w 2 {^) shows 
quantitative differences which arc caused by correlations 
between x and w: the mean square overlap ((xw) 2 ) turns 
out to be 1.22 ± 0.01u' 2 5' 2 instead of w; 2 ^ 2 as expected 



B. The CSG of mth degree - CSGm 



Multi-spin interactions were studied in fields like neu- 
ral networks |2lj] , low-autocorrelated sequences [[l0| and 
error-correcting codes [Q. The idea to include multi- 
spin interactions in our work originated as an attempt to 
improve the suppression of the autocorrelation function 
achieved by the CBG. The existence of four-spin interac- 
tions in the Bernasconi model implies that a CBG with 
multi-spin interaction might be useful in the construc- 
tion of low-autocorrelated sequences. However, it turns 
out that a CBG with multi-spin interactions suppresses 
the corresponding multi-spin correlations instead. 

In this section we apply multi-spin interactions to the 
CSG and define the CSGm, namely a CSG in which each 
weight component Wj connects the corresponding input 
unit Xj not only to the output but also to other m — 2 
additional input units. 



Assigning Aj 



1, 



2 to be the labels of the 



in — 2 additional input units participating in the jth in- 
teraction, the dynamics of a CSGm is given by: 

N m-2 

S t+1 = -eTi(/3j2^S t+1 ^ J] S t+1 ~ A ^)- (24) 



„*+i 



m-2 



(25) 
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A similar calculation under the same assumptions which 
are used to yield the solution of the original CSG gives 
the following general set of equations: 



w 2 = 



(S 2 ) = — arcsin 



16/3 

2 p2 w 2 S 2(m-l) 

l + 2f3 2 w 2 S 2 ( m ~ 1 ) 



; (26) 
(27) 



Solving numerically ( f26|) and fl27| ) for a large range of m 
values, both the bifurcation point, j c , and the first non- 
zero values of S 2 and w 2 were found to increase with m. 
For m — > oo one can easily show that 7 C — > oo while for 
the non-vanishing solution w 2 — > 1 and S* 2 — > 1 . Aim- 
ing to study the asymptotic behavior of S and 7 C in the 
large m limit, we set S 2 = 1 — e and find out that e must 
decay to zero at least as 1 /m in order for a non-zero so- 
lution to exist. This inverse relation between e and m 
derives S m terms, since (1 — e) m — » unless e < ^. 
Inserting S 2 = 1 — e in Eqs. ( p6| ) and ( p7| ) and expanding 
the resulting expression to a power series in e, the inverse 
relation between e and m leads to linear increment of 7 C 
as a function of m. The numeric solutions of the system 
in the large m regime supports the linear behavior of j c 
as derived from the aforementioned analysis (Fig. 11). 



Fig. [12] describes the numerical solution with respect 
to the simulation results for a system with m = 3. This 
harmony between analytics and simulations is observed 
for larger m as well. 
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FIG. 11. The linear relation between y c and m derived from 
numerical solution of Eqs. (Eq ) and (ETj). 



9 10 11 

-H3t| 

FIG. 12. Mean-field solution of the CSG3, compared to 
simulations with N = 2000. 



C. Autocorrelation function 

The relation Eq. @ that links the autocorrelation 
function to the weights still holds for the CSG. Since the 
weight vector is bounded in the CSG as well, the same ar- 
gument can be given for the first suppression of the first 
N values of the autocorrelation function. Correspond- 
ingly, Cj /p is almost indistinguishable from that of the 
CBG shown in Fig. §L 



D. Cycles and attractors 



The CSG can be seen as a nonlinear mapping that 
maps the vector x* w* onto x t+1 w t+1 . This is in 
contrast to previous work on the SGen [ p0[ , where the 
weights were fixed and could be considered parameters 
of the model rather than dynamic variables. The only 
real control parameter of this mapping is 7. 

Since both the sequence and the weights now live in 
a high-dimensional space of real numbers, the CBG can 
display a wide variety of behaviours, depending on iV and 
7: 

For 7 < 7c, the zero solution is the only attractor, and 
the system will quickly reach x* = and stop developing. 

For 7 slightly above 7 C , an irregular- looking time se- 
ries with the statistical properties calculated in Section 
[I A and displayed in Fig. ( |Io| ) is generated. However, 
the zero solution is still attractive, and after some time 
the system will drift close to it and stay there, i.e. the 
irregular behaviour is due to a chaotic transient rather 
than a proper chaotic attractor. 

The survival time on the transient increases dramati- 
cally with increasing N and 7. It is hard to decide from 
numerical results whether the average survival time (t s ) 
diverges with a power law ((t s ) oc |7 — 7d|"), as one 
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usually finds in scenarios where a chaotic transient be- 
comes a chaotic attractor | f23[ , or whether (t s ) increases 
exponentially with 7. In either case, the system shows 
chaotic behaviour for sufficiently long times to get stable 
numerical results - for example, for N = 20 and 7 = 7.0, 
the average survival time is on the order of 10 6 steps. 

If 7 is larger than some critical value that depends on 
N, the chaotic transient can eventually end in a cycle 
that is related to a possible cycle of the discrete CBG. 
By "related" we mean that S 1 * in the CSG is very close to 
±1 and that clipping the sequence to the nearest value 
of ±1 would give the equivalent attractor of the CBG. 
More different cycles become stable with higher 7; how- 
ever, the cycle lengths are usually of order 2N - short 
cycles arc apparently more likely to become stable than 
ones whose length is of order 2 N . 

At amplifications 7 slightly below the lowest 7 for 
which the first cycle becomes stable for for a given N, 
intermittent behaviour is observed: both £' and w stay 
near a cycle for an extended number of steps (typically 
several thousand steps for TV = 6) before returning to 
chaotic behaviour for a similar time. An example of this 
is given in Fig. [L3L 
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FIG. 13. Example of intermittent behaviour for N = 6, 
7 = 8. From top to bottom: output S 1 ', norm of weights w 
and lar gest "one-step Lyapunov exponent" max(ln(|A|) (see 
Section p^). 



E. Stability and Lyapunov exponents 



The term 'chaotic' was used in Sec. II D to describe 



the irregular time series generated by the CBG. We will 
now show that the system is in fact chaotic in the strict 
sense. 

The sensitivity of trajectories of the map (E7|), (|l^) to 
small changes in the initial conditions can be tested by 
calculating the eigenvalues of the Jacobi matrix 



rvT = 




(28) 



This is to be understood as a 2N x 2N matrix with in- 
dices i and j running from 1 to N. The entries of this 
matrix are of the following form: 



dx\ 



t+i 



= -/W^exp(-/3 2 /i 2 ); 



'7T 



dx t+1 

— l — = Si-ij for i = 2, . . . , N; 



dx 
dx\ 



3 

t+i 9 

-I3x) — exp(-/3 2 /i 2 ); 



/7T 



dx t+1 

—L^ = for i = 2, . . . , N; 

9w ^ + l V .^our '/ a„..tjt 2 . ... , ,2, 2 



dx~ = ~ ^/to^-^expHW); 



'TV 



dw 



t+i 



5 t . 3 -?Lpx)x\eM-P 2 h 2 ). 



(29) 



If \j3h\ is large and the transfer function is saturated, the 
exponential terms in Eq. (^9|) are negligible. In that case, 
the upper left section of M is occupied only on the first 
lower off-diagonal, the lower right section is the N x N 
unity matrix. Since the upper right section is identically 
0, the lower left part does not enter into the calculation 
of the eigenvalues either. 

This simplified matrix has N eigenvalues A = and N 
eigenvalues A = 1. The eigenvectors of the latter span the 
space of weight vectors, where small changes to w* are 
transferred unmodified to w t+1 . The eigenvalues A = 
all have the same eigenvector, whose only nonvanishing 
component is x^, the component of the sequence vector 
that is rotated out at t + 1. This means that the eigen- 
vectors do not span the whole space and that thus the 
eigenvalues are not a reliable measure of the propagation 
of a disturbance in the system. 

If \f3h\ is small enough for the exponential terms to 
have an appreciable effect, the effect on the eigenvalues 
is not easy to calculate. By using values of x and w taken 
from a run of the simulation and numerically calculating 
the eigenvalues, we find that typically one of the A = 
eigenvalues is changed drastically and may have an abso- 
lute value |A| > 1. This corresponds to a strong suscep- 
tibility of the newly generated sequence component S\ 
on small changes in w or x. The other eigenvalues only 
undergo small corrections, corresponding to the feedback 
of the new component to the weights. 

During the regular phases of intermittent behaviour, 
the largest eigenvalues of the one-step matrix are signif- 
icantly smaller than during the chaotic bursts (see Fig. 
|l3| ) - corresponding to sequence values that are close to 
S = ±1, and thus a nearly saturated transfer function. 

To find the Lyapunov exponents of the map (see e.g. 
p4j), it is necessary to consider the development of a 
small perturbation over a long time, i.e. calculate the 
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eigenvalues Af of Iltli (°f course, the trajectory is 
determined using the full nonlinear map). The Lyapunov 
exponents are then defined as 



A i = lim (l/T)ln|A 8 J 

1 — »oo 



(30) 



The straightforward calculation of the product of Jacobi 
matrices brings many numerical problems which can be 
eliminated by applying a Gram-Schmidt orthonormaliza- 
tion procedure to the columns of the product matrix in 
regular distances, as described in [^5). With this proce- 
dure, it is possible to average over T > lOOA and get 
numerically stable results. The largest Lyapunov expo- 
nent is displayed in Fig. (14). Typically, there are N/2 
positive exponents. 
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FIG. 14. Lyapunov exponents measured from the time de- 
velopment of perturbations and from iterating the Jacobi ma- 
trix (Eq. (5oh) , for 7 = 8 and 7 = 12. 



The Kaplan- Yorke conjecture 26 states that there is 
a connection between the dimension D of a attractor of 
a map and the spectrum of Lyapunov exponents, which 
are here assumed to be ordered (Ai > A2 > ... > A2at): 



D 



KY 



k 

E 

i=l 



VIA 



fc+i| 



(31) 



where k is the value for which 



yij—i Ai < 0. Applying this to the spectrum of expo- 
nents derived from d3(i| ) gives an average attractor di- 
mension between 1.1 A and 1.27V, slightly depending on 
7- 

An alternative method for measuring the largest Lya- 
punov exponent is to start two trajectories with infinites- 
imally different initial conditions, and propagate both of 
them using the nonlinear map. In regular intervals, mea- 
sure the distance between the trajectories, store it, and 
reset the distance to the initial value while keeping the 
direction of the distance vector. The advantage of this 
method is that it requires only O(N) calculations per 



time step, rather than 0{N 2 ) like the previous way, al- 
lowing to go to much higher N . 

The results for Ai are also displayed in Fig. [lj: the val- 
ues gained by the two methods agree well within the nu- 
merical errors. For large N, X max decreases with 1/N, i.e. 
perturbations grow on the a-timescale of online learning. 



III. SUMMARY 

In this paper, we have studied the properties of a time 
sequence generated by a perceptron which learns the op- 
posite of its own prediction. In the case of the simple 
perceptron, some properties are accessible analytically 
through the application of online learning techniques and 
through the connection between the weights and the au- 
tocorrelation function of the sequence. The distribution 
of learning rates among the weight components has a 
decisive influence on the statistical properties of the gen- 
erated sequence and allows for sequences with a wide 
variety of autocorrelation shapes. 

Due to the discrete nature of the sequence and the 
learning algorithm, cycles of the system are inevitable. 
We find that their typical length, as well as that of the 
transient, grow exponentially with the system size TV. 

A histogram of substrings of the generated sequence 
reveals that the sequence has significant deviations from 
randomness, although the deviations decrease with in- 
creasing N. 

Replacing the sign function in the update rule by a 
continuous sigmoidal function changes many of these re- 
sults. A vanishing solution now becomes possible; only 
for sufficiently large values of the rescaled amplification 
7 can nontrivial solutions survive. The critical 7 C can be 
calculated in a mean-field online learning calculation. 

Since both sequence and weights are now continuous, 
cycles vanish for low values of 7, and the trajectory is a 
chaotic sequence. The largest Lyapunov exponent scales 
like 1/7V for large Af; the spectrum of Lyapunov expo- 
nents suggests high-dimensional chaos. 

At least some cycles of the CBG reemerge as stable 
fixed points of the CGS above a critical 7 that is differ- 
ent for each attractor. Slightly below the smallest critical 
7 for a given N, intermittent behaviour is observed. 

Compared to the behaviour of sequence-generating 
perceptrons with fixed weights, the sequence generated 
with changing weights shows more complex behaviour: 
longer cycles, more randomness, chaotic as opposed to 
quasiperiodic behaviour. It seems likely that this ten- 
dency also holds for other algorithms in which the weights 
keep changing. 
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